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Abstract 

Background: The identification of functionally or structurally important non-conserved residue sites in protein MSAs 
is an important challenge for understanding tfie structural basis and molecular mechanism of protein functions. 
Despite the rich literature on compensatory mutations as well as sequence conservation analysis for the detection of 
those important residues, previous methods often rely on classical information-theoretic measures. However, these 
measures usually do not take into account dis/similarities of amino acids which are likely to be crucial for those 
residues. In this study, we present a new method, the Quantum Coupled Mutation Finder (QCMF) that incorporates 
significant dis/similar amino acid pair signals in the prediction of functionally or structurally important sites. 

Results: The result of this study is twofold. First, using the essential sites of two human proteins, namely epidermal 
growth factor receptor (EGFR) and glucokinase (GCK), we tested the QCIVlF-method. The QCMF includes two metrics 
based on quantum Jensen-Shannon divergence to measure both sequence conservation and compensatory 
mutations. We found that the QCMF reaches an improved performance in identifying essential sites from MSAs of 
both proteins with a significantly higher Matthews correlation coefficient (MCC) value in comparison to previous 
methods. Second, using a data set of 1 53 proteins, we made a pairwise comparison between QCMF and three 
conventional methods. This comparison study strongly suggests that QCMF complements the conventional methods 
for the identification of correlated mutations in MSAs. 

Conclusions: QCMF utilizes the notion of entanglement, which is a major resource of quantum information, to 
model significant dissimilar and similar amino acid pair signals in the detection of functionally or structurally 
important sites. Qur results suggest that on the one hand QCMF significantly outperforms the previous method, 
which mainly focuses on dissimilar amino acid signals, to detect essential sites in proteins. Qn the other hand, it is 
complementary to the existing methods for the identification of correlated mutations. The method of QCMF is 
computationally intensive. To ensure a feasible computation time of the QCMF's algorithm, we leveraged Compute 
Unified Device Architecture (CUDA). 

The QCMF server is freely accessible at http://qcmf.informatik.uni-goettingen.de/. 
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Background 

Multiple sequence alignments (MSAs) of homologous 
protein sequences give us information about two major 
features of the proteins of interest. The first one consists 
of easily detectable highly conserved residue sites that are 
obviously important for the structure and/or the func- 
tion of the protein; while the second one corresponds to 
compensatory (coupled) mutations between two or more 
residue sites that also contain crucial information on the 
structural and functional basis of proteins [1]. These com- 
pensatory mutations occur according to the functional 
coupling of mutation positions which might be explained 
as one mutation in a certain site affecting a compensat- 
ing mutation at another site, even if both related residue 
sites are distantly positioned in the protein structure 
[2-5]. In particular, such mutations at essential residue 
sites are likely to destroy protein structure which often 
results in loss of the protein function [6,7]. Thus, recog- 
nition of these residue sites is as important as the strictly 
conserved positions for the understanding of the struc- 
tural basis of protein functions and for the identification 
of functionally important residue positions [5,8,9]. 

Although the strictly conserved residue sites are easily 
detectable and interpretable in MSAs, the detection of 
important non-conserved compensatory mutation sites 
needs more complex approaches. Today, due to the 
simplicity and efficiency, the mutual-information-based 
metrics (Ml-metrics) are often used to measure the co- 
evolutionary relationship between residue sites in MSAs 
[4-6,10-13]. However, the Ml-metrics strongly depend 
on the amino acid distributions observed in the MSA 
columns rather than on physical or biochemical con- 
straints of amino acids that are likely to be crucial for 
the detection of functionally or structurally important 
compensatory mutations in a protein sequence. Further, 
according to the phylogenetic relationship of protein 
sequences and background noise, there is always a MI- 
value between each column pair in an MSA. Therefore, 
the challenging problems in bioinformatics for the detec- 
tion of significant compensatory mutation signals are: i) 
the minimization of the influence of phylogenetic rela- 
tionships of protein sequences by incorporating physical 
or biochemical properties of amino acids in the calcu- 
lation; ii) the separation of significant signals from the 
background noise or unrelated pair signals. 

In order to eliminate the influence of phylogeny and 
noise effects of MI, Dunn et al. [6] have introduced the 
average product correction (APC). Subtracting APC from 
MI, they obtained their MIp metric. However, in their 
model the reduction of background noise is not quanti- 
fied. On the other hand, Gao et al. [13] have integrated 
amino acid background distribution (MIB) in the calcu- 
lation of their Ml-metric and focused on only 25 column 
pairs of each MSA with the highest normalized MI values 



as significant to reduce noisy effect which seems to be 
over-conservative, yet specific. 

Large efforts have been made in the last few years to 
improve local-correlation-measure-based approaches to 
residue co-evolution when it comes to modeling effects 
that rely on spatial proximity (see [14] for an overview). In 
this case, it is necessary to disentangle direct and indirect 
correlations. Classical mutual information, for example, 
is high not only if the two sites under study are close in 
3D space. Quite the contrary, any local measure of cor- 
relation, not just mutual information, is limited by the 
transitivity effect. 

To overcome this problem, global statistical models of 
protein families are employed. The direct-coupling anal- 
ysis (DC A) works as follows. Maximizing the entropy 
subject to preserving the single and pair residue fre- 
quencies observed, a joint probability distribution on all 
possible members of the protein family is derived. Utiliz- 
ing this distribution, considerable progress in predicting 
residue-residue contacts in 3-dimensional protein struc- 
tures was made [15-17]. Protein Sparse Inverse Covari- 
ance (PSICOV) [18] achieves disentanglement of direct 
and indirect correlations by inverting a residue-residue 
covariance matrix. In [19] further progress was made by 
integrating structural context and sequence co-evolution 
information. 

There is merely a small number of methods that incor- 
porate amino acid similarity in the prediction of func- 
tionally or structurally important sites. In this context, 
it is natural to partition the amino acids into chemically 
similar groups before applying an information-theoretic 
measure like the Shannon entropy [20,21]. It was reported 
that many other methods fail to outperform this simple 
partition approach [22]. However, quantum information 
theory supplies a well-studied and powerful framework 
to integrate such similarity, where the classical Shannon 
entropy is swapped for the von Neumann entropy (VNE). 
Caffrey et al. [23] and Johansson et al. [24] have firstly 
introduced VNE to multiple sequence alignment analysis 
although they did not treat amino acid pair similarity. 

Recently, a new method called Coupled Mutation Finder 
(CMF) has been introduced by Giiltas et al. [5] to deal 
with phylogenetic noise as well as background signals and 
to quantify the error made in terms of the false discovery 
rate. The CMF method only focuses on BLOSUM62- 
dissimilar amino acid pairs as a model of compensatory 
mutations and integrated them in the calculation of nor- 
malized Ml-metrics using a doubly stochastic matrix to 
transform the empirical pair distribution of the column 
pair. However, the CMF disregards amino acid pair sim- 
ilarity which can be also crucial for the detection of 
functionally or structurally important sites in MSAs. 
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In this study, we present a new method called Quan- 
tum Coupled Mutation Finder (QCMF) which extends the 
CMF algorithm [5] by additionally incorporating amino 
acid pair similarity. To this end, the QCMF invokes prin- 
ciples from quantum information theory, in particular for 
the first time in the context of MSA analysis quantum 
entanglement as a major resource of quantum informa- 
tion. Amino acid pair distributions are replaced by entan- 
gled density matrices from quantum mechanics which 
encompass in our case both empirical pair distributions, 
possibly transformed by the doubly stochastic matrix used 
in [5], and pair similarity. Following Capra and Singh [22] 
who pointed out that it is hard to improve upon metrics 
based on Jensen-Shannon divergences, we quantify the 
effect of both amino acid pair similarity and amino acid 
pair dissimilarity by the quantum Jensen-Shannon diver- 
gence between an entangled density matrix and the one 
that simply represents the amino acid pair frequencies. 

The QCMF algorithm is strongly based on the matrix 
operations that are computationally intensive. When ana- 
lyzing a single MSA, the computational time of these 
matrix operations rise very quickly due to the huge num- 
ber of column pairs. In order to speed up the running 
time of the QCMF, we implemented its algorithm using 
Compute Unified Device Architecture (CUD A). CUDA is 
an efficient parallel computing architecture developed by 
NVIDIA that utilizes graphic processing units (CPUs) for 
general-purpose scientific and engineering applications 
[25]. Nowadays, GPUs are often used for computation- 
ally challenging problems in bioinformatics [26-29] and 
several other scientific fields [30-32]. 

Results 

Our main focus in this study was to investigate whether 
quantum information theory based measures could con- 
tribute beyond conventional measures to the identifi- 
cation of important residue sites. The Results section 
of this work twofold. First, to test the functionality of 
QCMF-significant individual residue sites we analysed the 
essential sites of two human proteins: epidermal growth 
factor receptor (EGFR) (pdb entry 2J6M) and glucokinase 
(GCK) (pdb entry 1V4S). The functionally and struc- 
turally important sites of both proteins have been experi- 
mentally investigated in several studies previously [33-44] 
and their positions were summarized in [5] as essential 
sites. The essential sites of these proteins consist of sev- 
eral non-conserved residue sites which are directly located 
at or near disease associated amino acid mutation (non- 
synonymous single nucleotide polymorphisms (nsSNPs)) 
sites, catalytic sites, protein binding sites and so on, each 
of which are likely to affect protein stability or function- 
ality (see [5] and references therein). In addition, residue 
sites are defined to be in contact according to the "nearby" 
definition of Nussinov et al. [45] if their carbon major 



atoms have a distance of less than or equal to 6 A. Con- 
sequently, we defined an individual QCMF-significant 
residue site as "functionally or structurally important" if it 
corresponds to one of these essential sites. 

Second, to further investigate the performance of 
QCMF and to make a comparison with the previous meth- 
ods (CMF [5], MIp [6], and PSICOV [18]), we selected 
a non-redundant set of proteins prepared by Janda 
et al. [46]. Although the dataset contains 216 proteins, we 
eliminated a few proteins due to inconsistency between 
corresponding MSAs and PDB files, so that we finally 
ended up with a dataset of 153 proteins (see Additional 
file 1). 

The MSAs for each protein, which contain after fil- 
tering at least 125 independent sequences, were derived 
from the HSSP-database [47] that merges primary struc- 
ture information and tertiary structure information of 
proteins. 

Finally, we define QCMF-significant sites as follows. Let 
M be an MSA, with the protein of interest being the first 
row of M. A site pair as well as an individual site of the 
protein are said to be QCMF-significant with respect to 
the MSAM, if they are (Qent.^) -significant or (Qsep,M)- 
significant. The latter two notions and the underlying two 
co-evolutionary column pair metrics Qent and Qsep are 
defined in the Methods section. If the MSA M is fixed, we 
speak of Qent -significance and Qsep-significance, rather 
than of (Qent«-A^)-significance and (Qsep, -M)- significance, 
respectively. 

QCMF-significant residue sites in the Human Epidermal 
Growth Factor Receptor (EGFR) protein 

Using the MSA-specific statistical model with a false dis- 
covery rate (FDR) of 1% for both QCMF-metrics, we first 
determined altogether 2688 out of 26079 non-conserved 
column pairs as significant in corresponding MSA of 
human EGFR protein. 631 of these significant pairs were 
detected by Qent-metric, and 2149 pairs were detected 
by Qsep-metric. Only 92 significant column pairs were 
detected by both metrics. After that, utilizing the connec- 
tivity degree technique, we predicted in total 33 residue 
sites in corresponding sequence of human EGFR protein 
as QCMF-significant (see Additional file 2). 12 of them 
are only Qent-significant and 18 residue sites are Qsep- 
significant, the remaining 3 residue sites (A839, A882 and 
V902) are both Qgnt-significant and Qsep-significant. 

10 of the QCMF-significant residue sites are in con- 
tact with either catalytic residues or critical active site 
regions for gefitinib binding site in wild type EGFR kinase 
[34,37,48] (see Figure 1 and Figure 2). Among these sites, 
the A839 and R841 have been verified as catalytic residue 
sites through the Catalytic Site Atlas [48]. The T854 is a 
gefitinib binding site by itself and the residue sites V845 
and A859 are also in contact with nsSNP positions K846, 
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Figure 1 QCMF-significant residue positions are in contact with catalytic residues in human EGFR protein (PDB-Entry 2J6M). Red spheres 
denote positions of the catalytic residues. Yellow spheres show the localization of significant adjacent residue positions found by QCMF which are in 
contact with these catalytic residues. Moreover, the QCMF-significant sites A839 and R841 are also catalytic residues by themselves. Green spheres 
show the structural localization of nsSNP positions found by QCMF as significant in the EGFR protein. The circles indicate clusters of catalytic residue 
sites and their significant adjacent sites. 



T847 and K860 in human EGFR protein. Moreover, two 
out of all 33 significant sites are related to disease associ- 
ated nsSNP positions and their structural localization are 
illustrated in Figure 1. 

Additionally, 13 out of all QCMF-significant sites are 
referred to as essential sites, each of them are either 
nearby strictly conserved residues or nsSNPs (see Table 1). 

According to the essential sites of human EGFR protein, 
published in [5] , we have shown altogether the structural 
or functional importance of 25 QCMF-significant sites. 
The remaining 8 significant residue sites (G729, T851, 
G779, Q820, M825, L927, G930, Y944) do not fall into 
essential sites and the reason for their significance and 
their importance in the EGFR protein is currently unclean 

QCMF-significant residue sites in the Human Glucokinase 
(GCK) protein 

Like human EGFR protein, applying the MSA-specific sta- 
tistical model with a FDR of 1% for both QCMF-metrics 
we identified a total of 9853 out of 69645 non-conserved 



column pairs as significant in the human GCK protein 
(pdb entry 1V4S). 6070 of them were (Qent. Al)-significant 
and 4232 were detected as (Qsep. Al)-significant. Only 449 
column pairs were detected as significant with respect to 
both metrics. Thereupon using the connectivity degree 
technique, we determined altogether 64 residue sites 
in the human GCK protein as QCMF-significant (see 
Additional file 3). 30 of them are determined as Qent- 
significant and further 30 significant residue sites are 
determined as Qsep-significant. Only four residue sites 
(T82, G223, V253, and G407) are significant based on both 
metrics. 

13 of QCMF-significant sites are in contact with 
allosteric sites V62, R63, M210, 1211, Y214, Y215, 
M235, V452, V455 and A456 in the human GCK pro- 
tein. Among these significant sites, the V^62, M210, 
^215 are allosteric sites by themselves [41] and the 
T209M, G223S and S453del are related to disease asso- 
ciated nsSNP positions. In addition, there are further 
five QCMF-significant sites (F123L, G162D, G175R, 
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Figure 2 QCMF-significant residue positions are in contact with gefitinib binding sites in human EGFR protein (PDB-Entry 2J6M). Red 

spheres show the structural localization of the gefitinib binding sites in the wild type kinase. Yellow spheres show QCMF-significant adjacent residue 
positions which are in contact with these binding sites. Moreover, the QCMF-significant site T854 is also a binding site by itself and interacts with 
gefitinib binding site D855.The circles indicate clusters of gefitinib binding sites and their significant adjacent sites. 



Table 1 QCMF-significant essential sites in the human 
EGFR protein, which are nearby either nsSNPs or strictly 
conserved sites 



QCMF-significant 
essential sites 


Nearby nsSNPs, or strictly 
conserved sites 


Reference 


N771 


773' 




[44] 


G824 


773' 




[44] 


V'827 


829' 




[44] 


;.828 


829' 




[44] 


1^834 


835', 


836', 860' 


[4449] 


^91 


892', 


895' 


[44] 


A822 


861' 




[4349,50] 


V844 


796', 


798', 852' 




71882 


884', 


895', 898' 




/900 


898', 


901' 




1/902 


880', 


901' 




7909 


906', 


936' 




G911 


906' 







: non-synonymous snp site, ^ : strictly conserved site. 



T228M, and E300K,Q) that have been verified as 
nsSNP positions through annotation databases and pre- 
vious experimental studies [38-40,42,43,51]. The struc- 
tural localization of these 18 QCMF-significant sites 
(contact sites and nsSNPs positions) are illustrated in 
Figure 3. 

Additionally, eight significant sites T149, G170, F171, 
T206, V207, A208, Q287 and G294 in contact with glu- 
cose binding sites (active sites) T168, K169, D204, D205 
and E290 in human GCK protein [41] (see Figure 4) where 
V207 and A208 are also in contact with the allosteric sites 
M210 and 1211. 

Moreover, we have also observed that 38 QCMF- 
significant sites are further included in essential sites since 
they are nearby nsSNPs or strictly conserved residues in 
human GCK protein (see Table 2). 

In total, we have demonstrated here that according to 
the essential sites of GCK, 62 out of 64 QCMF- significant 
sites are functionally or structurally important for human 
GCK protein. The remaining two significant residue sites 
V89 and N283 do not overlap with essential sites and the 
reason for their significance and their role in the GCK 
protein is still unclear. 
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G162D 



Figure 3 QCMF-significant positions that are either in contact with allosteric sites or related to nsSNPs in human GCK protein (PDB-Entry 
1V4S). Yellow spheres correspond to structural localization often significant residue sites which are in contact with allosteric sites where V62, M210, 
and Y215 are denoted as allosteric sites by themselves and they are also in contact with an other allosteric sites. Green spheres indicate eight 
significant nsSNP positions in the GCK protein. Three of them (T209IV1, G223S and S453del) are further in contact with allosteric sites IV1210, 121 1, 
V452, V455 and A456. 



Individual residue site comparison between 
QCMF-slgnificant sites and previous CMF-signlflcant sites 

We compared QCMF-significant residue sites for both 
human EGFR and GCK proteins with the significant 
residue sites given in [5] of the previous CMF-method. 
The CMF-method detected for both human proteins, 43 
sites in EGFR and 72 sites in GCK as significant. 

For the EGFR protein we found that the QCMF- 
significant residue sites Q791, Q820, G824, K860, Y891, 
T892, Y900, T909 overlap with results of the CMF- 
method. Interestingly, one of the unconfirmed residue 
sites, the Q820, has been predicted by both QCMF- 
method and CMF-method as significant. 

For GCK protein, we observed that in total 24 QCMF- 
significant sites (T60, T82, N83, F123, F148, T149, F152, 
H156, F171, N180, T206, T209, T228, E236, G260, L271, 
S281, N283, Q287, G294, E300, T332, F419 and E443) 
were also determined by the CMF-method as signifi- 
cant. Although both methods detected residue site N283 



as significant, it corresponds to one of the unconfirmed 
residue sites for GCK, currently. 

The CMF has been developed using normalized mutual 
information (MI) measures in order to detect important 
residue positions in MSAs. The method mainly focuses on 
significant BLOSUM62-dissimilar amino acid signals as 
a model of compensatory mutations and integrates them 
in the calculation of normalized Ml-metrics. As a con- 
sequence of mainly taking into account dissimilar amino 
acid signals, an important part of CMF-significant sites 
were verified as disease associated nsSNP positions and 
just a small part of them were located at or near the 
catalytic sites, allosteric sites and binding sites in both 
proteins. 

Moreover, when statistically evaluating both meth- 
ods, we have observed that the QCMF significantly 
outperforms the QCMF-method. The QCMF reaches 
an improved performance in identifying essential sites 
from MSAs of both proteins with a significantly higher 
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Figure 4 QCMF-significant residue positions are in contact with glucose binding site in human GCK protein (PDB-Entry 1 V4S). (A) Red 

spheres show the structural positions of the glucose binding sites (active sites) and yellow spheres show the localization of significant adjacent 
residue positions found by QCMF which are in contact with these active sites. The circles indicate clusters of glucose binding sites and their 
significant adjacent sites. 



Matthews correlation coefficient (MCC) value of 0.215 
whereas the CMF reaches only a MCC value of 0.133. 

Significant residue pair comparison 

To analyze whether the quantum-information-theory- 
based measures proposed in this study complements the 
coventional methods for the detection of correlated (co- 
evolutionary) mutations, we made pairwise comparisons 
between our new QCMF, MIp [6], PSICOV [18], and 
CMF [5]. 

All four methods take as input an MSA satisfying cer- 
tain admissibility criteria. The problem is that QCMF 
and CMF output the set of QCMF-significant sites and 
CMF-significant sites of M's reference protein, respec- 
tively, whereas PSICOV and MIp result in sets of impor- 
tant residue pairs. To make these outputs comparable, we 
extend them in all cases. 

Let Vqcmf denote the output of QCMF on any admissi- 
ble MSA M. We extend this set to what we call the QCMF- 
significant residue network TVqcmf := (Vqcmf. ^qcmf) 
of M as follows. Any two elements of Vqcmf are 



connected by an undirected edge belonging to £qcmf 
if and only if the corresponding column pair is QCMF- 
significant. 

The CMF-significant residue network Mcmf is analo- 
gously defined. 

In order to get a sufficiently large number MIp- 
significant and PSICOV-significant residue pairs, for 
every input MSA we simply took the top-ranking 
10% as Mlp-significant and PSICOV-significant, 
respectively. 

We then utilized the connectivity degree technique in 
the same way as we did for CMF and QCMF to calcu- 
late the set of Mlp-significant sites Vmip and the set of 
PSICOV-significant sites Vpsicov- 

For all four methods we used the 90th, the 95th and the 
99th percentile as cut-off values. 

Finally, the edge sets ^Mip and fpsicov were deter- 
mined by full analogy with the calculation of Sqcmv and 
^CiViF- Thus we obtained the Mlp-significant residue net- 
work Afuip and the PSICOV-significant residue network 
A/psicov- 
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Table 2 QCMF-significant essential sites in the human GCK 
protein, which are nearby either nsSNPs or strictly 
conserved sites 



QCMF-significant 
essential sites 

M37 

S76 

;.79 

T82 

N83 

1/86 

SI 27 

F148 

f152 

P153 

HI 56 

A]76 

6178 

N180 

;.185 

A20] 

/W202 

A232 

C233 

1/253 

F260 

L271 

V277 

S281 

^297 

/M298 

7"332 

1/374 

A378 

A379 

S383 

A384 

A387 

S388 

1/412 

F419 

£443 

G446 



Nearby nsSNPs or strictly 
conserved sites 

36»,39»,40» 

147"^ 

78',80M50' 
81' 

81', 108', 110' 
85',106» 

130' 

147', 150" 
150", 151' 
154' 
154' 

11 9', 175' 
164' 

162', 182' 
182' 
147' 
147' 
223' 
223' 
234' 
257- 



274" 
274" 
278" 
291" 
295" 
295" 
377" 
377" 
377" 
382' 
382' 
385' 
385' 
226' 
416' 
444" 
444" 



188' 
453' 
203' 
231' 

234', 235' 

254"^ 

258', 259', 261' 



278', 279' 
279' 

295', 299', 300' 
299', 300' 
299' 

382' 
382' 
385' 
385' 

392' 

227',410',414',416' 
445', 447' 

445', 447', 448', 449' 



' : non-synonymous snp site, "^ : strialy conserved site. 



Reference 



[38,39,43,51] 



[43,51] 
[38] 

[40] 

[38,39,43,51] 

[39,43,51] 

[39] 

[39] 

[43] 

[38,39,43], 
[39,43,51] 

[43] 

[39,4051] 
[39,40,51] 

[39,43] 

[43] 
[43] 
[43] 
[43] 



[43] 

[43] 

[43] 

[43] 

[43] 

[38,43] 

[40,43] 

[40] 

[39] 

[39] 



We performed tlie method comparison edge- 
oriented, with the number of overlapping edges as 
measure. We applied all four methods to the 153 

MSAs (see Additional files 1) described at the very 
beginning of this section and calculated the numbers 



"QCMF 



"CMF 



(0 

PSICOV 



(0 
MIp 



(0 



(0 



"^QCMF ' ' "^PSICOV 



r(0 
"Mlp 



(i) 

CMF 



and 



n f® 

I^^QCMF ' ' '"CMF 



QCMF ' ' "--MIp 

c(i) p,c(0 
"--MIp ' ' "--pSICOV 



on each of them, 



PSICOV ' ' ""CMF 

where the connectivity cut-off ranges over the 90th, 

the 95th and the 99th percentile, and i = 1,2,..., 153. 
Summing up the 153 numbers in each of these groups 



results in the numbers 



E, 
E, 
E, 
E, 



153 
1 

153 
1 

153 
1 



(0 

PSICOV 



z^i=i r, 



(i) 
MIp 



c(0 

I '"QCMF 
v^l53 



153 
1 



(0 



(0 



""QCMF ' ' ""PSICOV I 

c(0 p,c(0 I 
"--MIp ' ' "--pSICOV ' 



153 
1 



E, 

7(0 

'QCMF 



\c(i) 
rCMF 



n£, 



(i) 
MIp 



"^QCMF ' ' "^CMF 



153 
1 



"--MIp ' ' 



(0 

CMF 



and 
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and 4. 

Table 3 shows that all methods detect with the same 
connectivity degree cut-off a comparable number of edges 
in the corresponding significant residue network. 

Table 4 highly suggests that all four methods carry 
distinct information. The overlap between any two of 
them is less than or equal to 10%. This indicates that, 
under the assumption that each of them models important 
aspects of co-evolution, they complement each other per- 
fectly. In particular, this is true for QCMF as a quantum- 
information-science-based service compared with the 
other three established tools that are based on conven- 
tional methods. 

Implementation of QCMF: Parallel computing using CUDA 

The computation of both QCMF metrics (Equations 7 
and 8) is strongly based on matrix operations. Therefore, 
we implement QCMF algorithm using CUDA [25] which 
is very suitable to perform large number of vector and 
matrix operations in real time. This results in a dramatic 
reduction of computational time of QCMF. 

In this study, we use the CUDA 4.0 architecture 
(Toolkit) with several linear algebra libraries such as 
MAGMA [52], LAPACK [53], BLAS [54], GotoBLAS [55], 
CUBLAS [25] together (see Figure 5) to speed up the run- 
ning time of the QCMF algorithm. Since our program 
requires a cooperative multi threading to not fall in any 
asynchronicity or locks we extended the magma library 
with dynamic scheduling features according to [56]. Fur- 
ther, in order to be able to compare the performance, we 
also implemented the QCMF algorithm onto CPU archi- 
tecture alone. Both implementations were performed on 
an Intel Core™ i7-3770K Processor operating at 3.9GHz, 
with 16 GB of DDR3 RAM and a GeForce GTX 680 
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Table 3 Total number of edges in method-dependent significant residue networiu with respect to various connectivity 
degree cut-offs 

Total number of edges in significant residue networks 
Connectivy degree cut-off 90%th percentile 95%th percentile 99%th percentile 
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graphics card using the Ubuntu 13.04 operating system 

(64-bit version). 

Applying the QCMF algorithm for human EGFR pro- 
tein with CPU alone and with CUDA acceleration, the 
average computational time of a column pair was 0.7117 
seconds and 0.0301 seconds, respectively. Similarly, for 
human GCK protein, the average computational time of 
a column pair was 0.6977 seconds with CPU alone and 
0.0299 seconds with CUDA acceleration. Consequently, 
the algorithm took ~ 310 minutes for human EGFR pro- 
tein and '^811 minutes for GCK protein with CPU alone. 
On the other hand, applying the CUDA acceleration it 
took only ~ 13 minutes for EGFR and ~ 39 minutes for 
GCK protein. The comparison between the average times 
indicates that the required computational time of QCM- 
Falgorithm with the CUDA acceleration was significantly 
faster than with CPU alone (approximately more than 23 
times faster). 

Methods 

We predict important sites of a protein by detect- 
ing co-evolving residues. Our measures of co-evolution 
are quantum-Jensen-Shannon-divergence-based metrics 
of column pairs of a multiple sequence alignment, with the 
protein under study being the reference row. The quantum 
Jensen-Shannon divergence in turn has the von Neumann 
entropy as main building block. 

The von Neumann entropy was originally defined in the 
framework of quantum mechanics. We elucidate it in the 
subsequent section as far as it is necessary to understand 



our methods. Researchers interested in learning more are 

referred to the excellent textbook due to Vedral [57]. A 
comprehensive reference book was published by Nielsen 
and Chuang [58]. 

This section is organized as follows. In the first four 
subsections we recapitulate techniques developed in [5] 
which we leverage in this study. This concerns the defini- 
tion of significant site pairs and of significant individual 
sites, the preparation of the training data set used, and the 
computation of a doubly stochastic matrix D as our model 
of compensatory mutations on grounds of two counting 
matrices Cait and Cnuii- These two matrices also form the 
basis of the two amino acid pair similarity matrices Aent 
and ^sepi which in turn give rise to our new quantum- 
information-science-based metrics Qent and Qsep- The 
last four subsection are dedicated to their definitions. 

Significant column pairs and significant position with 
respect to a certain metric 

Let M be an MSA, where the protein of interest is rep- 
resented by Ms first row, and let E be a metric which 
assigns to every MSA column pair (yi, Y2) a real num- 
ber E(yi,)/2) €[0,1]. We call E a co-evolutionary col- 
umn pair metric if it models a biologically meaningful 
co-evolutionary signal: The larger the metric value on 
(yi, Y2)t the more likely co-evolution between position yi 
and position has occurred. 

Let 'p(i,j) be the empirical relative amino acid pair fre- 
quency of the J-th and the y'-th amino acid in column pair 
K2)) where i,j = 1, 2, . . . , 20. (When choosing a row of 



Table 4 Total number of edges in two networks of different type with respect to various connectivity degree cut-offs 

Total number of common edges in two networks of different type 
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Figure 5 Linking of the CUDA environment using C++. 



this column pair by pure chance, acid pair (/,/) is drawn 
with probability In the subsequent subsection we 

recapitulate the way developed in [5] to identify significant 
columns and significant column pairs with respect to E. 

A well-studied example (see [5,12]) of a co-evolutionary 
column pair metric is the normalized mutual information 



U(yi, y2) := 2 



H(Ki) + H(K2) - H(Ki, K2) 
H(k2 + Hy2) 



(1) 



where EI()/i, Y2), M(yi), and EI(y2) denote the Shannon 
entropy of the empirical pair distribution {^{ij)) ij^-^ 2 20 
of the column pair (yi, Y2) and its two marginals. 

In order to identify significant column pairs of the MSA 
under study with respect to the metric E, in [5] we have 
pointed out, that the distribution of E can be regarded as 
a mixture of a background /^-distribution Fq, an unrelated 
pair distribution Gi, and a distribution G2 of presumably 
co-evolving pairs. 

The j5-values 1 — f 0 (E) are then uniformly dis- 
tributed over [0, 1] given the underlying E-values are 
fo -distributed. In contrast, j?-values tend to zero or one, 
if E-values are G2-distributed or Gi -distributed, respec- 
tively. 

If, moreover, there is a sub-interval of [0, 1] which con- 
tains only data from the background distribution, on 
grounds of a result due to Storey and Tibshirani [59,60] 
we determined in [5] an MSA-dependent threshold for E- 
values. A column pair is said to be (E, Af) -significant, if its 
E-value is above the threshold, where the false discovery 
rate is bounded by a predefined constant. 



Figure 6 is a typical pictorial representation of met- 
ric distributions which can be treated that way to detect 
significant pairs. 

We applied that model in this study. 

We utilized the connectivity degree technique, intro- 
duced in [12] and developed further in [5], in order 
to define the (E, Af)-significance of individual residue 
sites. The connectivity degree of a position yi is the 
number of positions Y2 so that the site pair (yi, yi) is 
(E,M)-significant. A site of the protein of interest is then 
called (E, Al)-significant, if its connectivity degree cut-off 
exceeds the 90-th percentile. 

Training data set and pre-processing 

Following [5], a redundancy free set of more than 35000 
protein structures is our starting point. This collection 
was compiled in Rainer Merkl's Lab at the University of 
Regensburg. The protein structures were taken from the 
protein data base (http://www.pdb.org/). The PISCES ser- 
vices [61] was applied to assess proteins on sequence 
similarity and equality of 3D-data. The related MSAs were 
gathered from the HSSP data base (http://swift.cmbi.ru. 
nl/gv/hssp/). 

Taking pattern from [12], we filtered every MSA 
obtained as follows. First, highly similar and dissimi- 
lar sequences were deleted to ensure that the sequence 
identity between any two sequences is at least 20% 
and no more than 90%. Second, we removed strictly 
conserved residue columns, where the percentage of 
identical residues is greater than 95%. Third, we elimi- 
nated the residue columns which contain more than 25% 
gaps. Finally, we discarded all MSAs with less than 125 
sequences. More than 17000 MSAs survived the last fil- 
tering step. We used approximately 1700 MSAs published 
in [5] as our training data set which we randomly chose 
from this set. 

Setting up the counting matrices Cait and Cnuii 

The entries of the two matrices are frequencies of 
pair substitutions calculated from our training data set 
described in the foregoing subsection. Informally spoken, 
matrix Cait models the signal, whereas CnuU reflects the 
background. 

In line with [5], we calculated a signal and a null set 
of column pairs. The signal set consists of all (U,M)- 
significant column pairs, where M ranges over all training 
MSA. The null set consists of sufficiently many column 
pairs randomly chosen from every training MSA. For both 
the signal set and the null set we computed a symmet- 
ric 400 X 400 integer-valued matrix of frequencies of pair 
substitutions Cait and CnuU- To this end, the method used 
to compute BLOSUM62 matrices [62] is applied to count 
residue pair substitutions in MSA column pairs rather 
than residue substitution in columns. 



Gultas etal. BMC Bioinformatics 2014, 15:96 
http://www.biomedcentral.com/1471-2105/15/96 



Page 11 of 17 



10000 



8000 



ST 

c 

3 



6000 



4000 



2000 



0.05 0.1 0.15 0.2 



P-values of Fo distributed Q-values 



0.25 0.3 0.35 0.4 0.45 0.5 0.55 O.S 0.65 0.7 0.75 0.8 



0.85 0.9 0.95 1 



P-values of Q-values 

Figure 6 p-value distributions of Qent and Qsep-values for human EGFR protein (PDB-Entry 2J6M).The blue bars illustrate the /(-value 

distribution of the Qent -values and red bars display the p-value distribution of the Qjep-values. 



Computing a doubly stochastic matrix D 

According to [5], a pair ((«;, ay), (a^^, «/)) of amino acid 
pairs is said to be a formal dissimilar compensatory muta- 
tion, if the BLOSUM62 score both of («;, a^) and (fly, ai) is 
negative. 

Using Cait and Cnuib we define the matrix CcompMut by 

CcompMut {(ai,aj), (uk, ai)) 

{Cult {{ah aj), {Uk, ai)) if (?)compMut ((«;. «;). (fl/t. «;)) = 1; 
0 otherwise; 

where ^CompMut (aj^, ai)) = 1 if and only if either 

(«(, fly) = (flyt, fl;) or ((fl/, fly), (fl^, fl/)) is a formal dissimilar 
compensatory mutation and 

Cait {(ai, aj), (fl/o fl/)) 

J2i',f,k',l' C'alt ((«/', «/), «/')) 

^ Cnuii ((«(. ay), (a*:, a/)) 
Jli',j',k',i' C'nuii ((ai', ay), (a<;', fl/')) 

By normalizing CcompMut, we obtain a symmetric 
matrix -PcompMut- For fl/, fly, fl,;-, fl/ ranging over all amino 
acids, -PcompMut ((a/, ay), (fl/^fl/)) represents an empirical 
probability distribution on pairs of amino acid pairs. 



We then calculated the symmetric 400 x 400-matrix 



^CompMut ■ 



log 



-PcompMut {{ai, aj), (at, ai)) 



where ^compMut i^'' ^i) marginal distribution of 

-PCompMut- 

Having set all negative entries of 5compMut to zero, the 
doubly stochastic matrix D is computed by means of the 
canonical iterated row-column normalization procedure 
[63]. 

The doubly stochastic D is used to linearly transform 
empirical amino acid pair distributions of column pairs. 
If the pair distribution is regarded as a 400-dimensional 
row vector, matrix D is multiplied from the right. If then, 
for example, the resulting distribution is plugged into 
Equation 1, column pairs containing formal dissimilar 
compensatory mutations the £)-transition probability of 
which is relatively large tend to be up-scaled. 

The idea of the subsequent subsections is to design a 
model of MSA column pairs that takes formal dissimilar 
compensatory mutations regarded as pair dissimilarities 
as well as pair similarities into account. The challenge is 
to implement this in a way such that these two effects 
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interfere but do not interact. This is necessary since a sim- 
ilarity relation is transitive, whereas a dissimilarity relation 
is not. 

Setting up the two counting matrices Cent and Csep 

We set up two significant pair substitution matrices Cent 

and Csep from Cait and Cnull which form the basis of our 
new metrics Qent and Qsep- The intuition behind Cent is 
that the component-wise BLOSUM62-based pair similar- 
ity is rescaled, whereas Csep leads to a new amino acid pair 
similarity. 

Cent 

Ca\t {(at, Uj), {ak, ai)) if ^^ent «/). iak> = 1; 
0 otherwise; 

where (pf.ni{{ai,aj),{aif,ai)) = 1 if and only if either 
{ai, aj) = (a/f, ai) or the following two conditions are sat- 
isfied. First, the amino acids ai and as well as the amino 
acids aj and <2/ are BLOSUM62-similar. Second, 

Cait ((«/,«;), {ak,ai)) 



Hi',f,k',v Cait ((«('.«/). {at, ail)) 
^ Cnull ((«!. aj), (ak, ai)) 
Hi',j',k',l' ^nuU {iai',af), (ai^i,aii)) 



(2) 



Csep {{auaj), (ak.ai)) 

Cait ((«(, «;•), (ak, ai)) if ^Osep {(ah aj), (a^, ai)) = 1; 
0 otherwise; 

where ^sep {{ai, aj), {a^, aij) = 1 if and only if either 
{ai, aj) = {ajf, ai) or Equation 2 is satisfied. 

Calculating the two amino acid pair similarity matrices 

./4ent 3nd */4^ep 

Recall that a matrix A is positive definite (positive semi- 
definite), if there is an orthogonal matrix U (defining 
property = U^) such that U AU^ is a diagonal 

matrix, where the coefficients in the main diagonal are 
strictly positive (non-negative). 

Let us call a 400 x 400-matrix A a amino acid pair simi- 
larity matrix, if A is positive definite and the entries in the 
main diagonal are equal to 1, whereas the off-diagonal ele- 
ments A(g^k),(i,j) {{S' ^) 7^ {i'j)) are greater than or equal to 
0, but less than 1. 

The entries of an amino acid pair similarity matrix A are 
interpreted as follows. The closer A(g_h)_{ij) to 1, the more 
similar are the amino acid pairs {g,h) and {i,j). 

Let C be either Cent or Csep. We define 

C" 



where {{g, h), {i,j)) ranges over all possible 160000 indices 
of pairs of amino acid pairs including the main diagonal, 
and a e (0, 1) was set to 0.1 in order to enhance the effect 
of similarity. 

Because of the fact, that matrix B is not in any case 
positive definite, we finally set 



A := B^B, 



(3) 



which is justified by the transitivity of similarity. That 

way the amino acid similarity matrices ^ent and .4sep 
are obtained from the counting matrices Cent and Csep, 
respectively. 

Amino acid pair similarity matrices generalize amino 
acid similarity matrices used by Johansson et al. [24] for 
evaluating amino acid conservation. 

Modeling MSA column pairs and single columns by means 
of density matrices 

Let (yi, Y2) be a column pair of a multiple sequence align- 
ment, let {^(i,j))ij—i 2 20 empirical amino acid pair 
distribution in these columns, let (?(i,;)),y^i2 20 
linear transform of ) ^ 2 20 ''^ doubly stochas- 
tic matrix D, and let A be an amino acid pair similarity 
matrix. 

Recall, that the trace of a matrix is the sum of its 
coefficients in the main diagonal. 

Taking pattern from quantum mechanics, we model 
column pair (yi, y^) by a positive semi-definite 400 x 400- 
matrix the trace of which is equal to 1, a so-called density 
matrix. Regarding the two distributions (p(i,j)) ^ j _ ^ 2 20 
and (?((,;)),y_i2 20 ^ 400-diagonal matrices the 

main diagonal of which are formed by the probabili- 
ties and'q(ij), respectively, we integrate the classical 
model into the quantum-mechanics-based one. 

Generalizing the approach for amino acid used in [24] 
to amino acid pairs, our density matrices are of the shape 



B 



(gMUhi) '■= 



l.h),(i.i) 



/y-20 7^20 ' 



P (r, A) := (V^^(^''').('./)\/^)5,fc,y = i,2,...,20 ' 

where r(i,y) is either ^(^y) or^(y) {i,J = 1, 2, ... , 20). Using 
this denotation, the diagonal density matrices considered 
in the preceding paragraph are equal to some p (r, 1), 
where 1 is the 400 x 400-identity matrix. 

In this study, we regard individual MSA columns 
only as components of column pairs. In the classical 
case, where MSA-column pair (yi, y2) is modeled by an 
MSA-dependent amino acid pair distribution T (either 
{P(i,f)) ij - 12 20 '^^ some derivative), the columns yi and 
y2 are represented by the corresponding marginals'?! and 
7^ ofr. 
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In quantum information science, the counter part of the 
marginals rL and r2 of 'r are the partial traces tr2(/)) and 
tri(p) of p. They are 20 x 20 density matrices defined by 

20 20 

(tri {p))ij := ^ Pkkij (tr2 (p))// := ^ Pijkh 
k=i k=i 

where i,j = 1, 2 . . . , 20. As opposed to the indices of 
the marginals, matrix tri (p) models column y2, whereas 
matrix tr2(p) represents column yi. 

Defining our two new metrics Qent and Qsep 
To begin with, we define the von Neumann entropy 
VNE(p) of a diagonal density matrix p as the Shannon 
entropy of its main diagonal coefficients regarded as a 
probability distribution. 

The crucial property of a density matrix p is that 
there exists an orthogonal matrix U such that UpU^ is 
a diagonal density matrix, where the diagonal elements 
are uniquely determined up to their order. Thus we are 
justified to finally define 

VNE (p) := VNE (upU^^ , (5) 

where U is an orthogonal matrix diagonalizing p in a way 
just mentioned. 

In principle, the following holds true. The larger 
the off-diagonal coefficients of the similarity matrix A, 
the smaller the von Neumann entropy of the density 
matrix according to Equation 4 compared with the Shan- 
non entropy of the probability distribution 'r^^pf = 
1,2,... ,20). 

In order to compare two density matrices p and a of 
the same dimension, we make use of the quantum Jensen- 
Shannon divergence: 

QJSD(pllor) := VNE ((p + a)/2)-(VNE(p)-f VNE(o-)) /2. 

(6) 

It can be shown that 0 < QJSD(p||(t) < 1, where 0 is 
attained if and only if the two density matrices p and <r are 
equal. As oppose to the case of Equation 1, we have thus 
avoided a normalization. 

We are now in a position to define our new two metrics 
for a certain column pair of a given MSA. As before, the 
amino acid pair distribution^ is given hy'p-D, where D is 
the 400 X 400 doubly stochastic matrix described above, 
'p is the empirical pair distribution of these two columns, 
and 1 is the 400 x 400-identity matrix. 

Then our first metric Qent is defined by 

Qent := QJSD (p (q, A^nt) II P 1)) (7) 

(see Equation 4). This metric measures the difference 
between a density matrix combining rescaled amino acid 
pair similarity with dissimilar compensatory mutations 
and the empirical amino acid pair distribution. The index 



"ent" indicates that here we make use of quantum entan- 
glement, which in turn is a major resource of quantum 
information science. (Entangled 400 x 400-density matri- 
ces are those that cannot be represented as a convex 
combination of Kronecker products of 20 x 20-density 
matrices. Note, that the Kronecker product of density 
matrices is the analog of the classical product of probabil- 
ity distributions). 
Our second new metric Qsep is given by 

Qsep := QJSD (tri (p (p, Aep)) lltra (p Aep))) • 

(8) 

The density operator p (p, Asep) is entangled. How- 
ever, before finally calculating the metric, we separate 
the columns of the pair by applying the two partial trace 
operators. 

Using the example of the human EGFR protein (PDB- 

Entry 2J6M), Figure 6 illustrates that the method we 
developed in [5] to determine significant column pairs is 
well-applicable for both Qent and Qsep- The results pre- 
sented in this work prove that Qent <is well as Qsep are 
powerful co-evolutionary column pair metrics. 

Discussion 

Grosse et al. observed in [64] that the Jensen-Shannon 
divergence (JSD) can be interpreted as mutual informa- 
tion between two (or more) random sources in a special 
setting particularly appropriate to discriminate between 
these sources. This is what we need when it comes 
to predicting important protein sites in an MSA-based 
approach. It might explain the findings of Capra and Singh 
[22] on the predictive power of JSD. These two arti- 
cles encouraged us to utilize quantum Jensen-Shannon 
divergence (QJSD) in this study. As a side effect, a normal- 
ization is not necessary, since quantum Jensen-Shannon 
divergence, like its classical counterpart, ranges over the 
real interval [0, 1]. 

Several studies have confirmed the fact that detecting 
coupled MSA-columns is extremely useful in the predic- 
tion of important protein sites (see e.g. [4-6,10-13,65-70]). 
When using information-theoretic metrics, there is no 
doubt that it is reasonable to incorporate amino acid pair 
dissimilarity as well as amino acid similarity in a consistent 
way such that similarity decreases entropy, whereas dis- 
similarity increases it. This kind of consistency is impor- 
tant, since entropy is the fundamental building block for 
most of those metrics. In particular, the Jensen-Shannon 
divergence between two probability mass functions (pmfs) 
p and q equals e(l/2(j? + q)) - l/2(H(p) + H(^)). 

In [5] an amino acid pair dissimilarity model for com- 
pensatory mutations is presented. A doubly stochastic 
matrix transforms the empirical amino acid pair distribu- 
tion of a column pair. 
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Rescaled pair similarity of BLOSUM62-similar pairs 
is to capture an aspect of coupled MSA column pairs 
orthogonal to the phenomenon of dissimilar compen- 
satory mutations. It models the amino acid pair transition 
preferences within those column pairs on the average. 
As suggested by Caffrey et al. [23] as well as Johansson 
et al. [24], it is promising to incorporate them within the 
framework of quantum information theory. Therein, den- 
sity matrices replace pmfs. The counterpart of the entropy 
of a pmf is the von Neumann entropy (VNE) of a density 
matrix (see Equation 5). QJSD corresponds then exactly to 
JSD (see Equation 6). 

The challenge was to complement the model presented 
in [5] by additionally incorporating amino acid pair sim- 
ilarity in a way that the two effects interfere but do not 
interact. We model an MSA column pair by means of a 
400 X 400-density matrix, rather than amino acid pair 
distributions. This provides us with the opportunity to 
utilize the notion of entanglement, which in turn is a 
major resource of quantum information. In our model, 
partial traces play the role of the marginals in the classi- 
cal case. Pair similarity is reflected by means of positive 
definite pair similarity matrices (see Equation 3), where 
positive definiteness, which is a key property of density 
matrices, can only be ensured by using transitivity of simi- 
larity. Since there is no transitivity of dissimilarity, we kept 
dissimilarity apart from that similarity matrix. Instead, we 
carried over the CMP dissimilarity model of [5]. Similarity 
matrix and transformed amino acid pair distribution are 
joined together by means of Equation 4 in the final step 
of our density matrix design. That way we minimize the 
interaction between the two effects of dissimilarity and 
similarity. 

In order to eliminate the noise and to define an MSA- 
dependent threshold for significant column pairs, we fol- 
lowed the line of [5]. The model presented there seems 
to be of universal applicability. The same is true for the 
connectivity degree model introduced in [12] and further 
developed in [5]. Combining them results in a reliable and 
robust method to determine significant residues. 

The results we present in this study show that the vast 
majority of QCMP-significant residue sites are closely 
related to functionality and structural stability of both 
human EGPR and GCK proteins. 10 significant residue 
sites in EGPR and 19 significant sites in GCK are estab- 
lished as functionally important since they are directly 
located at or close to catalytic sites, allosteric sites and 
binding sites which are crucial for maintaining protein 
functions and for understanding the underlying molecular 
mechanism (see Figures 1,2,3,4). Additionally, 2 signifi- 
cant sites in EGPR and 8 significant sites in GCK (three 
of them are also in contact with allosteric sites in GCK) 
are related to disease associated nsSNP regions of both 
proteins. As has been noted in [5], most disease-causing 



mutations at these positions in corresponding sequences 
destroy structural features of proteins, thus affecting pro- 
tein stability and often results in loss of protein function. 

Although the importance of almost all QCMP- 
significant sites are verified through essential sites of 
both human proteins, there are still eight and two uncon- 
firmed significant sites in EGPR and GCK proteins, 
respectively, which do not fall into essential sites. It is 
interesting to note that some of these unconfirmed sites 
are also referred as significant by CMP [5]. We therefore 
believe that most of these unconfirmed sites identified 
by our present method may have an importance for the 
function and structural stability of both proteins notwith- 
standing the absence of previous experimental data. A 
further comparison reveals that the overlaps between 
the results of the QCMP method and the CMP method 
are quite low, indicating that both methods detect con- 
siderably different sets of residue sites as functionally 
and structurally important. The comparison results 
clearly show that considering similar and dissimilar 
amino acid signals simultaneously, our present method 
is more sensible to catalytic, allosteric and binding sites, 
while only focusing on dissimilar signals the previ- 
ous method deals successfully with nsSNP positions in 
proteins. 

The final comparison between QCMP and CMP on 
EGPR and GCK proteins is made by inspecting several 
connectivity degree cut-offs. We initially set it to the 90- 
th percentile at which CMP reaches its maximal MCC 
value. Going through all possible «-th percentiles for n = 
80, 81, ... , 99, QCMP reaches its maximal MCC value of 
0.231 if K = 88. What we got can be summarized as 
follows. On the one hand QCMP shows a better perfor- 
mance than CMP in identifying important residue sites. 
On the other hand QCMP complements CMP. This is 
because of the fact that the method of QCMP is more 
information rich than that of CMP. QCMP simultane- 
ously uses similar and dissimilar amino acid pair signals, 
whereas CMP's method focuses only on amino acid pair 
dissimilarity. 

To confirm the educated guess that QCMP comple- 
ments conventional methods both from information the- 
ory and statistics, we applied QCMP, CMP [5], MIp [6] 
and PSICOV [18] to the 153 MSAs described at the begin- 
ning of the Results section. In sum, each of these methods 
detects different residue pairs as important, where the 
pairwise overlap is bounded from above by 10%. The rea- 
son for that is that the four methods model different 
aspects of amino acid pair co-evolution. Consequently, 
they carry distinct information. 

To further improve the specificity of QCMP it is promis- 
ing to combine its quantum-information-theory-based 
framework with the direct pair distribution derived in 
DCA (see e.g. [15] or [16]). 
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Conclusions 

In this work, we report a new method, QCMF, apply- 
ing principles of quantum information theory. In contrast 
to the previous method CMF which focused on dissim- 
ilar amino acid signals, QCMF simultaneously models 
similar and dissimilar amino acid pair signals in the detec- 
tion of functionally or structurally important sites. QCMF 
includes two metrics based on quantum Jensen-Shannon 
divergence. While the first metric measures compen- 
satory mutations between pairs of columns, the second 
metric considers the sequence conservation of columns. 
Results show that QCMF reaches an improved perfor- 
mance in identifying important sites from MSAs and it 
predicts a quite different set of residue sites as functionally 
and structurally important (in comparison to the previ- 
ous method). Further, results indicate that the residue 
sites found by QCMF are more sensible to catalytic sites, 
allosteric sites and binding sites than those found by the 
previous method. On the top of that, a pairwise compar- 
ison with existing methods shows that QCMF is comple- 
mentary to them when it comes to predicting co-evolving 
residue site pairs. 
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